Riemannian metrics for neural networks 



oo 

o 



C3 



Yann Ollivier 
May 3, 2013 



O ■ Abstract 

(N ■ 

We describe four algorithms for neural network training, each adapted 
to different scalability constraints. These algorithms are mathemati- 
cally principled and invariant under a number of transformations in 
data and network representation, from which performance is thus indc- 
£\j . pendent. These algorithms are obtained from the setting of differential 

geometry, and are based on either the natural gradient using the Fisher 
information matrix, or on Hessian methods, scaled down in a specific 
way to allow for scalability while keeping some of their key mathemat- 
ical properties. 

CO 

The most standard way to train neural networks, backpropagation, has 
several known shortcomings. Convergence can be quite slow. Backpropa- 
gation is sensitive to data representation: for instance, even such a simple 
OO ! operation as exchanging 0's and l's on the input layer will affect perfor- 

mance (Figure [T]), because this amounts to changing the parameters (weights 
and biases) in a non-trivial way, resulting in different gradient directions in 
parameter space, and better performance with l's than with 0's. (In the 
related context of restriced Boltzmann machines, it has been found that the 
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£T) ■ standard training technique by gradient ascent favors setting hidden units 

to 1, for very much the same reason |AAHOl"i] Section 5].) This specific 
phenomenon disappears if, instead of the logistic function, the hyperbolic 
tangent is used as the activation function. Scaling also has an effect on 
performance: for instance, a common recommendation [LB OM96| is to use 
1.7159 tanh(2:r/3) instead of just tanh(x) as the activation function. 

It would be interesting to have algorithms whose performance is insen- 
sitive to particular choices such as scaling factors in network construction, 
parameter encoding or data representation. Such invariance properties mean 
more robustness for an algorithm: good performance on a particular prob- 
lem presumably indicates good performance over a whole class of problems 
equivalent to the first one by simple (e.g., affine) transformations. 

Ways exist to deal with these issues, such as Hessian methods or the 
natural gradient, which are invariant (and thus preserve performance) over 
a wide class of changes in the representation of the data and of the network. 
However, these are generally not scalable (the cost of maintaining the whole 
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Figure 1: Backpropagation learns better with l's. A neural network with 
two layers of size 50 (no hidden layer) is trained to reproduce its input. A 
random binary sequence x of length 50 with 75% of l's is generated. The 
network is trained on the input-output pair (x, x) for 100 backpropagation 
steps (learning rate 0.01). The experiment is repeated on the input-output 
pair (1 — x, 1 — x). In both cases all initial weights are set to 0. 



Hessian or Fisher information matrix is prohibitive for large networks). The 
approximations made to ensure their scalability, such as keeping only di- 
agonal terms or making small-rank approximations, break these invariance 
properties. 

We present four algorithms which share a common inspiration from Rie- 
mannian metrics and are adapted to four different scalability constraints. 
The most lightweight has the same complexity as backpropagation, up to 
a constant factor. The heaviest requires that the network is sparsely con- 
nected (the average number of units influencing a given unit must not be 
too large) and that the output layer is not too large. This latter condition 
is typically fulfilled in classification tasks. 

The unitwise natural gradient is a scaled-down version of Amari's nat- 
ural gradient [ Ama9 8] in which the blocks of parameters afferent to each 
unit are treated independently, thus dealing, at each unit, with a square 
matrix indexed by the parameters afferent to this unit. This method has 
been proposed as far back as |K ur94| to train neural networks; however the 
presentation in |Kur94j is limited to networks with only one hidden layer, 
because it relies on an explicit symbolic computation of entries of the Fisher 
matrix. Here Proposition [2H allows for an efficient computation of the ex- 
act Fisher information matrix in arbitrary neural networks, by doing n out 
distinct backpropagations for each sample in the dataset. As a result, the 
unitwise natural gradient is adapted to situations where both the connectiv- 
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ity of the network and the output layer are reasonably small. 

The backpropagated metric gradient is not altogether new either: it can 
be described as a blockwise quasi-Hessian method in which several approxi- 
mations (Gauss-Newton and neglecting cross-unit terms) are used. However, 
we describe it in an intrinsic way: it stems from a well-defined backpropa- 
gated metric on parameter space, in which no approximations are involved. 
Invariance properties follow immediately from this viewpoint. It is adapted 
to networks with reasonably small connectivity but output layers of arbitrary 
size. 

The quasi- diagonal natural gradient and quasi- diagonal backpropagated 
metric gradient apply a quasi-diagonal reduction to these two algorithms, 
which removes the quadratic dependency on connectivity at each unit. This 
is done in a specific way to keep some (but not all) of the invariance prop- 
erties, such as insensitivity to using sigmoid or 1.7159 tanh(2x/3). The 
quasi-diagonal natural gradient still requires that the output layer is not too 
large, whereas the quasi-diagonal backpropagated metric gradient has the 
same complexity as ordinary backpropagation up to a constant factor. The 
quasi-diagonal natural gradient and quasi-diagonal backpropagated metric 
gradient have not been described before, to the best of our knowledge. 

Backpropagation is the simple gradient descent over parameter space. 
Gradient descents follow the steepest direction of change in parameter space, 
and implicitly rely on a norm (or quadratic form, or metric) to define the 
steepest direction (see Eq. I36|) . For backpropagation this norm is the nu- 
merical change in the values of the parameters: backpropagation provides 
the direction of largest improvement for a maximal change in these numeri- 
cal values. Hence simple changes in parametrization influence the behavior 
of the algorithm. On the other hand, norms based on what the network 
does, rather than how it is represented as numbers, will lead to intrinsic 
algorithms. This is one of the ideas behind Amari's natural gradient. 

In Section[2]we build several invariant norms, by placing neural networks 
in the context of differential manifolds and Riemannian geometry. The gra- 
dient descent coming from an invariant norm (Riemannian metric) will itself 
be invariant. Moreover, any gradient descent using any norm has the prop- 
erty that small enough learning rates ensure performance improvement at 
each step. 

The resulting algorithms are all invariant under a number of transfor- 
mations, including affine reparametrization of the unit activities. Among 
the invariance properties enjoyed by the unitwise natural gradient and the 
backpropagated metric (but not their quasi-diagonal reductions) are linear 
recombinations of the input received by a given unit in the network, so that 
a unit receiving signals / and f + eg (as functions over the dataset) will learn 
an output correlated to g just as fast as a unit receiving signals / and g (on 
the input layer this can be accounted for by normalizing and de-correlating 
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the inputs, but this could occur at internal units as well). Thus gradients 
have a "best-fit" interpretation (Section 13.31) : at each unit they solve a 
least-square problem of matching input signals and desired backpropagated 
output, an interpretation proposed in [Kur94j . 

The quasi-diagonal reductions of these algorithms are based on the ob- 
servation that there is a distinction between weeights wn~ and Wjk coming 
to k from different units, but no intrinsic mathematical separation between 
weights and biases. Intuitively, given that unit k receives a signal WikCii from 
unit i, if we change to Wik + Swa-, the average signal to unit k will change 
by SwikCLi where a, is the average activation of i. Hence it might be a good 
idea to automatically add — Swu-ai to the bias of k, to compensate. The 
quasi-diagonal algorithms we present are more sophisticated versions of this, 
tuned for invariance and using weighted averages. The few added terms in 
the update sometimes greatly improve performance (Fig on page !44|) . 

None of these algorithms is second-order: they are based on (Rieman- 
nian) metrics evaluating the magnitude of the effect on the output of changes 
in a given direction, thus providing a suitable learning rate for each direc- 
tion. They emulate second-order effects in the same way the Gauss-Newton 
algorithm emulates the Newton methocQ. 

Even though the focus of this article is mainly the mathematics of neural 
network training, we tested experimentally the impact of using the various 
methods. We selected a very simple auto-encoding problem on which we 
expected any training method would perform well. A sparsely connected 
network with 5 layers of size 100, 30, 10, 30, and 100 was built, and 16 
random length-100 binary strings were fed to the input layer, with the target 
equal to the input. Ideally the network learns to encode each of the 16 
samples using 4 bits on the middle layer (thus with room to spare) and 
rewrites the output from this. The details are given in Section SI 

Even then, backpropagation performs poorly: after 10, 000 batch passes 
the average log-loss is about 36 bits per sample (out of 100) for sigmoid 
backpropagation, and about 25 bits per sample for tanh backpropagation. 
Note that 30 bits per sample would correspond to a method which learns 
only the parameters of the output layer and can reproduce the output if 
someone fills the last hidden layer with the correct 30 bits. 

For the same total computation time equivalent to 10, 000 batch back- 
propagations^], the quasi-diagonal algorithms have a log loss of about 1.5 to 
3.5 bits per sample, and both the unitwise natural gradient and the back- 
propagated metric gradient have a log loss of 0.3 to 1.5 bit per sample, thus 

Actually, in the framework of differential geometry, the Hessian is intrinsically defined 
only at local optima of the function, so one could say that in such a setting the Newton 
method approximates the Gauss-Newton algorithm rather than the other way around. 

2 as measured in CPU time on a personal computer, but this can depend a lot on 
implementation details 
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Figure 2: Auto-encoding using a 100-30-10-30-100 deep sparsely connected 
network. Comparison of backpropagation using sigmoid and tanh activation, 
and the four algorithms described in Section [TJ for a given computation time 
budget. 



essentially solving the problem. See Figure [21 

The impact of the few non-diagonal terms in the quasi-diagonal algo- 
rithms was tested by removing them. In this case the quasi-diagonal back- 
propagated metric gradient reduces to the diagonal Gauss-Newton method 
[LBOM961 Section 7.4]. Of course, this breaks the invariance properties, 
thus the impact is different for sigmoid or tanh implementations. The di- 
agonal Gauss-Newton method in sigmoid implementation was found to per- 
form more poorly, with a final log-loss of about 12 bits per sample (Figure [S] 
on page |4"4"|) , while in tanh implementation it comes somewhat close to the 
quasi-diagonal algorithms at about 3.5 bits per sample (presumably because 
in our problem, the activity of all units, not only input units, stay perfectly 
centered during training). Thus the quasi-diagonal backpropagated metric 
gradient can be seen as "the invariant way" to write the diagonal Gauss- 
Newton method, while performance of the latter is not at all invariant. 

We also compared the exact unitwise natural gradient obtained thanks 
to Proposition |2"4"1 to a variant of the natural gradient in which only the 
gradient terms 66 T corresponding to the target for each sample are added 
to the Fisher matrix ( |RMB07] and Section 12.51 below). The latter, when 
implemented unitwise, performs rather poorly on this auto-encoding task, 
with a log loss of about 25 to 28 bits per sample. The reason, discussed 
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in Section [U may be that quality of the one-sample approximation to the 
Fisher matrix strongly depends on output dimensionality. 

One lesson from the numerical experiments is that the regularization 
term eld added to the matrices, needed to prevent bad behavior upon in- 
version, formally breaks the invariance properties: individual trajectories in 
sigmoid or tanh implementations, initialized in the same way, start to differ 
after a dozen iterations. Still, overall performance is not affected and is the 
same in both implementations (Figured p. I42p . 

Though the quasi-diagonal methods perform well, the only methods to 
sometimes reach very small values of the loss function on this example (less 
than 0.1 bit per sample) are the unitwise natural gradient and the backprop- 
agated metric, which at each unit maintain a full matrix over the afferent 
parameters and thus achieve invariance under affine recombination of incom- 
ing signals. These two methods are relevant only when network connectivity 
is not too high. This highlights the interest of sparsely connected networks 
from a theoretical viewpoint. 

Acknowledgments. I would like to thank Youhei Akimoto, Jeremy Ben- 
sadon, Cecile Germain, Balazs Kegl and Michele Sebag for helpful conversa- 
tions and their feedback on this article. 

Notation for neural networks 

Consider a directed neural network model: a set C of units together with 
a set of directed edges i — > j for i,j G £, without cycle. Let £ out be the 
output units, that is, the units with no outgoing edges, and similarly let £i n 
be the set of units without incoming edges. We use the logisic activation 
function: given an activation level for the input units, each unit j gets an 
activation level 

aj = sigm^ -^a;) = J = — (1) 

depending on the activation levels of the units i pointing to j and on the 
firing coefficients iPy frorr|^| itoj. Biases are treated as the weights u>oj from 
a special always-activated unit (do = 1). For a given non-input unit j, we 
call the parameters woj and Wij for i — ► j the set of afferent parameters to 
unit j. We refer to [RN03] . which we mostly follow with minor changes in 
notation. 

The dataset for this network is a set T> of inputs, where each input 
x G M^ in is a real-valued vector over the input layer. For each input is given 
a target y in an arbitrary space. We view the network as a probabilistic 

3 What is Wij for some authors is Wji for others. Our convention is the same as |RN03) 
but for instance LBOM96 follows the opposite convention. 
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generative model: given an input a% = Xi on the input layer £j n , we assume 
that the activations of the output layer are interpreted in a fixed way as a 
probability distribution co(y) over the target space. The goal is to maximize 
the probability to output y on input x: we define the loss function 

£(y) := -]nu(y) (2) 

the sum of which over the dataset is to be minimized. For instance, in- 
terpreting the output layer activities (afc)fce£ ou t as Gaussian variables with 
mean a k and variance 1 leads to a quadratic loss function I. 

We will use three possible interpretations of the activities of the output 
layer as probability distributions. 

Definition 1 (Bernoulli, square-loss, and classification 
interpretations) . 

The Bernoulli interpretation of the output layer is a Bernoulli distribution 
as follows: given the activations (a k ) k ^c out of the output layer, the final 
output is a {0, l} Cont -valued random variable Y = (Yfc)fce£ ou t of independent 
Bernoulli variables, where the activations a k give the probability to have 
Y k = 1, namely Pr(Y k = 1) = a k . 

The square-loss interpretation of the output layer sends the activations 
( a k)keC ut °f the output layer to a random variable Y = (Y k ) k ^c ou t of in- 
dependent Gaussian variables, where each Y k ~ Af(a k ,l) is a Gaussian of 
mean a k . 

The classification interpretation of the output layer sends the output 
activations (afc)fce£ ou t to a probability distribution over the set indexing the 
output layer, where the probability of class k G C out is a\/(^ k , &Cout a\,). 

(Our choice for the classification interpretation is motivated by two argu- 
ments, theoretical and practical. First, the set of probability distributions 
over a finite set, equipped with its Fisher metric, is isometric to the positive 
quadrant in a sphere and so is naturally parametrized by numbers a k with 
XIoj = 1, and these variables yield a simple expression for the Fisher matrix. 
Second, taking squares gives a boost to the most activated output unit, in a 
smooth way — in a deterministic context one would define the most activated 
output unit as the class of the sample) . 

A common way to train the network on a given target value y is by back- 
propagation. Define the backpropagated values hi (for a given loss function) 
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at each unit i by induction from the output layer: 

hi := ~~ J!r ^ or * ^ n ^ ne ou tput layer £ out 

^) (Bernoulli) 

yi — ai (square-loss) 

M - 2a » ^ 2 (classification) 

A : = Ej i-^j WijO-ji^ - aj)bj for i £ out 

(3) 

after having, of course, computed the activation levels cij by forward prop- 
agation. Then 6j indicates how the loss will react to a given change in the 
activation of unit i: 

<9ctj 

for all units i (not only the output layer) , so that bi indicates how we should 
modify the activities to decrease t. Moreover we have 

= -a iaj (l - aj )bj (5) 



dwij 



for each edge (ij) in the network. (This includes the bias woj using ao = 1.) 
So backpropagation computes the gradient of the loss with respect to the 
parameters. 

It is sometimes more convenient to work with the reduced variables b\ := 
aj(l — dijbi which satisfy 

(bi = Vi-ai for i in the output layer £ out 

\bi = ai(l - ai) Wijbj for i £ ont 

and then one has 

di 

= (7) 

The gradient descent with learning rate rj > is then defined as the 
following update on the firing coefficients: 

di 

Wij <- Wij - Vq^- = w ij + V a>iaj(l - ctj)bj = ?%• + rj aibj (8) 

1 Four invariant gradient algorithms 

We now describe four gradient algorithms for network training: the unit- 
wise natural gradient, the quasi- diagonal natural gradient, the backpropa- 
gated metric gradient, and the quasi- diagonal backpropagated metric gradi- 
ent. Each of these algorithms is adapted to a different scalability constraint. 
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The unitwise natural gradient requires low connectivity and a small out- 
put layer; the quasi-diagonal natural gradient requires a small output layer; 
the backpropagated metric gradient requires low connectivity; the quasi- 
diagonal backpropagated metric gradient has the same asymptotic complex- 
ity as backpropagation. 

These algorithms are the implementation, using sigmoid activation, of 
the more general versions described in Section [2J As they are designed 
for invariance properties, implementing them using tanh activation would 
result in the same output, learning trajectories, and performance, so we 
simply present one implementation. 

We first present the algorithms in a batch version. It is straightforward 
to adapt them to use random mini-batches from the dataset. In Section Tl .31 
they are adapted to an online setting: this can be done using standard 
techniques because the main quantities involved take the form of averages 
over the dataset, which can be updated online. 

1.1 Unitwise natural gradient and quasi-diagonal natural gra- 
dient 

The unitwise natural gradient has been proposed as far back as [Kur94] 
to train neural networks; however the presentation in [Kur94] is limited to 
networks with only one hidden layer, because it relies on an explicit symbolic 
computation of entries of the Fisher matrix. Proposition [2H below allows 
for an efficient computation of the exact Fisher information matrix by doing 
n out distinct backpropagations for each sample in the dataset. This relies 
on linearity of backpropagation, as follows. 

Definition 2 (Backpropagation transfer rates). 

Fix an input x for the network and compute the activities by forward prop- 
agation. Let k be a unit in the network and k out be a unit in the output 
layer. Define the backpropagation transfer rates J^ out from k on % to k by 
backpropagating the value 1 at fe ou t- Formally: 

j£°ut := i ; j£°ut := o, for k / A: out in the output layer C out 

jfcout ._ k _^. WkjO,j{l — dj)Jj° ut for non-output units k 

' ' ' ' (9) 

These transfer rates have the property that if backpropagation values b 
are set on the output layer, then = 2~Zfc out e£ out ^™'^out f° r an y nm ^ & 
(see also [LBOM961 Section 7.2]). 

Computation of the transfer rates can be done by n ou t backpropagations. 
There are further simplifications, since the transfer rates for k in the input 
layer are never used (as there are no afferent parameters), and the transfer 
rates on the last hidden layer are readily computed as J^° ut = Wkk out a k ut (1 — 
afc out ). Thus it is enough to backpropagate the transfer rates from the last 
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hidden layer to the first hidden layer (and so with only one hidden layer, 
the case considered in |Kur94j for the Fisher matrix, no backpropagation is 
needed). 

Definition 3 (Fisher modulus). 

Fix an input x for the network and compute the activities by forward prop- 
agation. For each unit k in the network, define the Fisher modulus &k{ x ) 
of unit k on input x as follows, depending on output layer interpretation. 

• For the Bernoulli interpretation, set 

( rfcout\2 

**(*):= E n ' i U°) 

• For the square-loss interpretation, set 

**(*):= E ( J k° ut ) 2 (11) 

^out C^out 



For the classification interpretation, set 

**(*):=! E (^° ut ) 2 -^( E | (12) 

"-out out \ feout £^out 

where S := Y\ k fFZ > a? . 
Definition 4 (Unitwise Fisher matrix). 

Let k be a unit in the network. Let E}. be the set of units afferent to k 
(including the always-activated unit 0). The unitwise Fisher matrix at unit 
k is the (#E k ) x (#E k ) matrix F<® defined by 

F t f := E xeT > OiOj-oKl - a k ) 2 <5> k (13) 

for i and j in E k (including the unit with ao = 1). Here ^ x ev represents 
the average over samples x in the dntHset (nil the terms cii, cij, cl^, $fc depend 
on the input to the network). 

By Proposition [24] below, F^ is the block of the Fisher information 
matrix associated with the parameters afferent to k, hence the name. 

Definition 5 (Unitwise natural gradient). 

The unitwise natural gradient with learning rate r] > updates the param- 
eters of the network as follows. 

For each unit k, define the vector by 

G[ k) := -E x&v ^^ = E xe v OittfcCl - a k )b k (14) 

OWik 
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where b k is the backpropagated value at k obtained from the target y asso- 
ciated with x, and where i runs over the units afferent to k (including the 
always-activated unit i = 0). Compute the vector 

fob) := (fW)- 1 ^ (15) 

Then the parameters of the network are updated by 

Wik *- Wik + rjSwf 1 ' (16) 

The unitwise natural gradient requires updating a matrix with (#-Efc) 2 
coefficients for each data sample (and inverting it, but this is done only once 
per pass over the dataset). This is fine if connectivity is low. We now define 
a more light-weight version in case connectivity is large. Its computational 
cost is equivalent to that of ordinary backpropagation provided the output 
layer is small. 

Definition 6 (Quasi-diagonal natural gradient). 

The quasi-diagonal natural gradient with learning rate rj > updates the 
parameters of the network as follows. 

For each unit k, compute only the entries Fqq , f j , aiid F^f 1 of the 
unitwise Fisher matrix at k. Define the vector as in (|14p above. Define 
the vector Sw^ by 

, . r (k) F (k) _ r (k) F (k) 

faf ) = G i f oo G o for ^ (17) 

* T?( k ) rp{ k ) _ (Tp( k >\2 

r ii r 0Q \ r 0i I 

and update the parameters of the network by 

Wik <- Wik + ridwf^ (19) 

These latter formulas may seem arbitrary. If we remember that (omitting 
the superscript (k)) F 0Q = E x£ n a? k (l - a k ) 2 ® k , F Qi = E x£T > e^af (1 - a fc ) 2 $ fe , 
and Fa = K x£ x> a f a \{^~ a k) 2 ^ki we can consider these sums as expectations 
over the dataset with weights af.(l — ak) 2 ^k- Then the weighted average of 
<2j is Ai = Fqi I Foo and its weighted variance is V% = Fa / Fqq — Af so that we 
have 

and in particular the denominator is always positive unless the activity of 
unit i is constant (in this case, the numerator vanishes too). 

A possible interpretation is as follows: If the activity ai of i is centered 
over the dataset (with the weights above), then the update is diagonal and 
the activities are scaled by 1/V». If a% is not centered, when we change a 
corresponding term is automatically subtracted from the bias WQk so as not 
to shift the average activity of unit k, as discussed in the intro. 
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1.2 Backpropagated metric gradient and quasi-diagonal back- 
propagated metric gradient 

Computing the Fisher matrix as above requires performing n out backprop- 
agations for each sample. If one tries to compute the Fisher modulus 
directly by backpropagation, the backpropagation equation involves cross- 
terms between different units. Neglecting these cross-terms results in a sim- 
pler version of the Fisher modulus which can be computed in one backward 
pass; the corresponding backpropagation equation is well-known as an ap- 
proximation of the Hessian [LBOM96, Section 7]. It turns out this quantity 
and the associated metric are still intrinsic. 

Definition 7 (Backpropagated modulus). 

Fix an input x for the network and compute the activities by forward prop- 
agation. Define the backpropagated modulus m k (x) for each unit k by 

m k (x) := w lj a )( l ~ a j ) 2 m j (x) (21) 

if k is not an output unit, and, depending on output interpretation, 

(^1) (Bernoulli) 
m k( x ) '■= \ 1 (square-loss) (22) 

U C 1 - f ) > S = £ fcout 6£ out °iL (classification) 
for k in the output layer. 

Definition 8 (Backpropagated metric). 

Let k be a unit in the network. Let E k be the set of units afferent to k 
(including the always-activated unit 0). The backpropagated metric at unit 
k is the (#E k ) x (#E k ) matrix defined by 

:= E x( zt, aidjalil - a k ) 2 m k (23) 

for i and j in E k (including the unit with clq = I). Here E x£ x> represents 
the average over samples x in the dataset (all the terms a^, dj, a k , m k depend 
on the input to the network). 

The backpropagated metric gradient can thus be described as an approx- 
imate, blockwise Hessian method in which the Hessian is approximated by 
the Gauss-Newton technique with, in addition, cross-unit terms neglected. 
Such a method turns out to be intrinsic. 

Definition 9 (Backpropagated metric gradient). 

The backpropagated metric gradient with learning rate r\ > updates the 
parameters of the network as follows. 
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For each unit k, define the vector by 



G\ := -E X£V -^ = E xev OiOfcCl - a fc )6 fc (24) 



where is the backpropagated value at k obtained from the target y asso- 
ciated with x, and where i runs over the units afferent to k (including the 
always-activated unit i = 0). Compute the vector 

5 w ( k ) := (mW)" 1 ^' (25) 

Then the parameters of the network are updated by 

Wik <- Wik + rjSwf (26) 

Definition 10 (Quasi-diagonal backpropagated metric gra- 
dient). 

The quasi-diagonal backpropagated metric gradient with learning rate r) > 
updates the parameters of the network as follows. 

For each unit k, compute only the entries Mqq\ and M-^ of the 

unitwise Fisher matrix at k. Dehne the vector as in (|14|) above. Define 
the vector 5w^ by 

5w f) = gi_fQ0_gg m m 

and update the parameters of the network by 

Wik *- Wik + rjSwf 1 ' (29) 
The same remarks as for the quasi-diagonal natural gradient apply for 

(k) (k) / (k)\ r > 

interpreting the various terms. The denominator Mq Q — (M^ ) can be 
seen as a weighted variance of the activity of unit i, and is positive unless a, 
is constant over the dataset. The contribution of 5w^ to 5w^ compensates 
the change of average activity induced by a change of Wik ■ 

The asymptotic cost of this update is the same as for backpropagation. 

If, in the quasi-diagonal backpropagated metric gradient, the non-diagonal 
terms are neglected (M^ is set to 0), then this reduces to the diagonal 
Gauss-Newton method equations from [LBOM96, Section 7.4]. 

Remark 11. 

On the parameters afferent to the output layer, the unitwise natural gradient 
and the backpropagated metric gradient coincide if the Bernoulli or square- 
loss interpretation is used. (Actually, with learning rate r\ = 1 they also both 
coincide with the Newton method restricted to the output layer parameters.) 
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Remark 12. 

Since these algorithms rely on inverting matrices, regularization is an issue. 
In practice, terms e Id have to be added to F and M before inversion; terms 
e have to be added to the diagonal terms F^\ F^\ and in the 

quasi-diagonal reduction. This formally breaks the invariance properties; 
Section [3.31 elaborates on this. Still, this operation preserves the guarantee 
of improvement for small enough learning rates. 

1.3 Adaptation to an online setting 

The unitwise natural gradient and unitwise backpropagated metric gradient 
both update the weights by 

5w = A- l G (30) 

with G the gradient of the loss function over the dataset, and A a positive- 
definite, symmetric matrix. A key feature here is that the matrix A takes 
the form of an expectation over the dataset: F^' = ~E x ev Oi0 3 -o|(l — ak) 2 &k 

for the Fisher matrix, and M^- = E^g-p aiCLja\(l — a^m^ for the backprop- 
agated metric. 

Any such algorithm can be turned online using a standard construction 
as follows (compare e.g. [RMB07J). Another possibility is, of course, to use 
mini-batches. 

In the following, A stands for either the unitwise Fisher matrix or the 
backpropagated metric. Let A{x) be the corresponding contribution of each 
input x in the expectation, namely, A(x)\j = aiaja\{l — a^) 2 ^ for the 

Fisher metric and A(x)\j = aiaja\(l — ak) 2 mk for the backpropagated 
metric, so that A = E x &v A(x). 

At each step t, we use one new sample in the dataset, update an estimate 
A® of A, and follow a gradient step for this sample, as follows. 

• Initialize the matrix A^ by using a small subsample "Dimt C T>, for 
instance the first ni n it samples in the dataset. Namely: 

A<°> :=E xe v init A(x) (31) 

• Fix a discount factor < 7 < 1. For each new sample xt, compute its 
contribution A(xt) and update A by 

AW := (l- 7 )^- 1 )+ 7 A(x t ) (32) 

• Compute the inverse of A^> from the inverse of A^ -1 ) by the Sherman- 
Morrison formula at each unit, using the fact that A(xt) is a rank-one 
matrix at each unit. (This way matrix inversion is no more costly than 
the rest of the step.) 
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• Compute the negative gradient G(xt) of the loss function on input xt 
by backpropagation. 

• Update the parameters by 

w^w + rjtiA^^Gixt) (33) 
where rjt is the learning rate. 

• For the quasi-diagonal reductions of the algorithms, only the entries 
^4oO) An and A$i of the matrix A are updated at each step. No matrix 
inversion is required for the update equations (fT7|) - ([T8|) and (f!IT|) - (f2g|) . 

We could also initialize A^ to a simple matrix like Id, but this breaks 
the invariance properties of the algorithms. 

The update rule for A® depends on the discount factor 7. It should 
be large enough so that a large number of data points contribute to the 
computation of A, but small enough to be reactive so that A evolves as 
training gets along. In our setting, from the particular form of A(x) at a 
unit k we see that each A{xt) contributes a rank-one matrix. This means 
that 7 should be much smaller than l/n/% with the number of parameters 
at unit k, because otherwise the estimated matrix A® will be close to a 
low-rank matrix, and presumably a poor approximation of the true matrix 
A, unreliable for numerical inversion. 

The same remark applies to the number n- m ^ of samples used for initial- 
ization: it should be somewhat larger than the number of parameters at 
each unit, otherwise A^ will be of low rank. 

2 Constructing invariant algorithms: Riemannian 
metrics for neural networks 

2.1 Gradient descents and metrics, natural metrics 

The gradient of a function / on W 1 gives the direction of steepest ascent: 
among all (very small) vectors with a given norm, it provides the greatest 
variation of /. Formally, the gradient V/ of a smooth function / is defined 
by the property that 

f(x + ev) = f(x) + e(Vf,v)+0(e 2 ) (34) 

for any vector v, for small enough e. This depends on the choice of a scalar 
product (•,•). In an orthonormal basis, the coordinates of the gradient are 
simply the partial derivatives df / dxi so that gradient descent is 

Xi <- Xi - rjdf/dxi (35) 
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in an orthonormal basis. 

For a given norm of the vector v, the quantity (V/, v) is maximal when v 
is collinear with V/: so the gradient V/ indeed gives the direction of steepest 
ascent among all vectors with a given norm. The gradient step x 4— x — r/ V/ 
can actually be rewritten (for small enough rj, up to 0(t] 2 ) and for regular 
enough functions /) as 

x <- argmin + \\y - x\\ 2 ^ (36) 

namely, the gradient descent moves into the direction yielding the smallest 
values of /, penalized by the distance from the current point|j. This allows 
to see how the choice of the scalar product will influence the direction of 
the gradient V/: indeed, another scalar product will define another norm 
\\v \\ 2 = (v, v) for the vector v, so that the steepest direction among all vectors 
with a given norm will not be the same. The norm thus defines directions v 
in which it is "cheap" or "expensive" to move; the gradient is the direction 
of steepest ascent taking this into account. 

If we happen to work in a non-orthonormal basis of vectors vi, ■ ■ ■ , v^, 
the gradient is given by A~ 1 df/dx where df/dx is the vector of partial 
derivatives with respect to the coordinates X{ in the basis, and A is the 
symmetric matrix made of the scalar products of the basis vectors with 
themselves: Aij := {vi \ Vj). Thus gradient descent of / takes the form 

x<- x-r)A- l df/dx (37) 

Conversely, we can start with a norm on M. d defined through a positive- 
definite, symmetric matrix A (which thus defines "cheap" and "expensive" 
directions). The gradient ascent in the metric A will then be given by (|37|) . 

So any update of the form above with A a symmetric, positive-definite 
matrix can be seen as the gradient descent of / using some metric. The 
metric A may even depend on the current point x, defining a Riemannian 
metric. 

An important feature of gradient descent, in any metric, is that for 77 
small enough, the step is guaranteed to decrease the value of /. 

The choice of a metric A represents the choice of a norm for vectors in 
parameter space. Conversely, choosing a set of parameters and using the 
"naive" gradient ascent for these parameters amounts to implicitly deciding 
that these parameters form an orthonormal basis. 

For the neural network above, the gradient ascent Wij <— Wij — f)-Mr: 
corresponds to the choice of A = Id on parameter space, namely, the norm 
of a change of parameters 5w = (dwij) is \\Sw\\ 2 := I &Uij\ 2 ■ Thus, gradient 
descent for Wij gives the best 6w for a given norm ||<5w||, that is, the best 
change of / for a given change in the numerical value of the parameters. 



4 This can be used to define or study the direction of the gradient in more general 
metric spaces (e.g., [AGS05I Chapter 2]). 



16 



Example: from sigmoid to tanh activation function. Neural net- 
works using the sigmoid and tanh activation function are defined, respec- 
tively, by 

a k = sigm( a i w ik) (38) 
i, i—tk 

and 

a' k = tanh( ^ a' iW ' ik ) (39) 

i, i—^k 

(including the biases WQ k and w' ok ). Since tanh(x) = 2sigm(2x) — 1, the 
activitives of the network correspond to each other via a' k = 2a k — 1 for all 
k if we set 

w ik = -7- (40) 



for i ^ 0, and 

2 4 



% = — + iL ^ ( 41 ) 



for the biases. 

Consequently, while the gradient for the sigmoid function will try to 
improve performance while minimizing the change to the numerical values 
of Wi k an d WQk) the gradient for the tanh function will do the same for the 
numerical values of w' ik and w' ok , obviously resulting in different updates. If 
we follow the tanh gradient and rewrite it back in terms of the variables Wi k , 
we see that the tanh update expressed in the variables w,i k is 

w ik w ik + 16(Sw ik - -Sw 0k ) (i^O) (42) 



and 



^ofc <- wok + 4(5u> fc - 8^2(5w ik - ]rSw k) (43) 



where Swi k is the update that would have been applied to w ik if we were 
following the standard sigmoid backpropagation. Indeed this takes the form 
of a symmetric matrix applied to 5wi k (the cross-contributions of SwQ k to 
Wi k and of 8wi k to WQk are the same). 

Apart from an obvious speedup factor, an important difference between 
this update and ordinary (sigmoid) backpropagation on the Wi k is that each 
time a weight Wik is updated, there is an opposite, twice as small contribution 
to WQ k : in this sense, it is as if this update assumes that the activities aj are 
centered around 1/2 so that when Wi k gets changed to Wik + c, one "needs" 
to add — c/2 to the bias so that things stay the same on average. 
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Newton's method and gradient descent. To find the minimum of a 
function / on R, one can use the Newton method to solve /' = 0, namely, 
x <— x — f'{x) //"(»). In higher dimension this becomes 



where df/dx is the vector of partial derivatives, and (Hess := d 2 f /dxidxj 
is the Hessian matrix of /. 

Around a non-degenerate minimum of /, the Hessian Hess / will be a 
positive-definite matrix. So the Newton method can be seen as a gradient 
descent with learning rate rj = 1, in the metric A = Hess/, when one is 
close enough to a minimum. 

Intrinsic metrics. There could be a lot of arguing and counter-arguing 
about the "right" way to write the parameters with respect to which the 
gradient should be taken. The solution to avoid these choices is known: 
use metrics that depend on what the system does, rather than on how the 
parameters are decomposed as numbers. 

The Fisher metric, which defines a natural gradient |AN00| . is one such 
metric. Namely: the size (norm) of a change of parameters is measured by 
the change it induces on the probability distribution of the output of the 
model. The symmetric matrix A used in the gradient update is then the 
Fisher information matrix. We will use scaled-down versions of the Fisher 
metric for better scalability. 

We present another metric for neural networks, the backpropagated met- 
ric. The size of a change of parameters at a given unit is measured by the 
effect it has on the units it directly influences, which is itself measured re- 
cursively in the same way up to the output layer. The matrix defining this 
metric is obtained by well-known equations related to the Gauss-Newton 
approximation of the Hessian. 

2.2 Intrinsic metrics and their computation by backpropaga- 



Here we rewrite the definition of neural networks in the language of differ- 
ential manifolds and Riemannian geometry; this allows to define metrics 
directly in an intrinsic way. 

Consider a neural-like network made of units influencing each other. The 
activity of each unit k takes values in a space Ak which we assume to be 
a differentiate manifold (typically R without a preferred origin and scale, 
but we allow room for multidimensional activations). Suppose that the 
activation of the network follows 



x x — (Hess/) df/dx 



(44) 



tion 




(45) 
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where a^, . . . ,di nk are the units pointing to k, and where ft is a function 
from Ai 1 x • • • x Ai n to Ak, depending on a parameter 9^ which itself belongs 
to a manifold Here we have no special, always-activated unit coding for 
biases: the biases are a part of the parameters 8\.. 

We shall also assume that the output units in the network are interpreted 
through a final decoding function to produce an object uj = oj((ak)kec ut) 
relevant to the initial problem, also assumed to belong to a differentiable 
manifold. 

To implement any gradient ascent over the parameters 9, we first need a 
(Riemannian) metric on the parameter space. Such a metric can be defined 
by choosing a parametrization by R rf and deciding that the elementary vec- 
tors of M. d are orthogonal, but this is not intrinsic: different parametrizations 
will lead to different learning trajectories. 

In this setting, an object is said to be intrinsic if it does not depend 
on a choice of parametrization of any of the manifolds involved (activities, 
parameters, final output). 

We assume that we are given a meaningful Riemannian metric on the 
final output cu: that is, we know how to measure the size of a change in the 
output. For instance, if oj describes a probability distribution over a target 
variable y, we can use the Fisher metric over oj. 

Then there are several possibilities to define intrinsic Riemannian metrics 
on parameter space. The most direct one is the Fisher metric: the output is 
seen as a function of all parameters, and the norm of a change of parameter 
69 (over all parameters at once) is the norm of the change it induces on the 
output. This is not scalable. 

A more scalable version is to break down the change of parameter into 
a sum of changes of parameters afferent to each unit and take the Fisher 
metric at each unit independently. This is the unitwise Fisher metric. As 
we will see, it scales well to sparsely connected networks if the output layer 
is not too large. 

An even simpler version is the backpropagated metric, defined by back- 
wards induction from the output layer: the norm of a change of parameter 
on the output layer is the norm of the change it induces on the final result, 
and the norm of a change of parameter at an internal unit is the sum of the 
norm of the resulting changes at the units it influences directly. 

In what follows, we use the standard objects of differential geometry but 
try to present them in an intuitive way. The notation 5a, 59, 5oj denotes 
tangent vectors on the corresponding manifolds (intuitively, differences be- 
tween two very close values of a or 8 or oj). The notation J^- denotes the 
differential of the activity dj seen as a function of a& (intuitively, this is just 
the Jacobian matrix of partial derivatives). The various metrics involved 
are (0, 2)-tensors, but we use standard matrix notation for them. 

Definition 13 (Natural metric, unitwise natural metric, 
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BACKPROPAGATED METRIC). 

Let \\ouj\\ = lijdojidujj = our ISoj be a metric on the final output of the 
network, given by the symmetric, positive-definite matrix I. We define three 
metrics on the parameter set. 

• The natural metric on the parameter set 9 = {9k) is dehned as follows. 
Let x be an input in the dataset V and let uj{x) be the hnal output of 
the network run with input x and parameter 9. Let 59 be a variation 
of 9 and let 5ui{x) be the resulting variation of oj(x). Let 

ll<W|lLt,x : =IIM^)l| 2 (46) 
and then define the natural metric by 

\\Se\\l t :=E xeV \\6e\\l tjX (47) 

In matrix form, we have 5uj(x) = 59 where ^ is the Jacobian 
matrix of u(x) as a function of 9, so that the natural metric is given 
by the matrix 



\m\Lt = Ex-d«' -&r I ^r 69 ( 48 ) 

The natural metric is given by a matrix of size dim 9 = J2k dim 9k- 
The unitwise natural metric on the parameter set 9 is 

ll^llu-nat : =Ell^Hnat (49) 
k 

where k runs over the units of the network and 59k is the variation of 
the parameters afferent to unit k. This metric is given by keeping only 
the block-diagonal terms afferent to each unit in the matrix dehning 
the natural metric. 

In case uj is a probability distribution and the metric I on u is the 
Fisher metric, we also call ||o~0|| nat and ||<5#|| u _ nat the Fisher metric and 
unitwise Fisher metric. 

The backpropagated metric on 9 is dehned as follows. Let x be an 
input in the data. We hrst dehne a metric on each of the activities a^, 
depending on the input x, working from the output layer backwards. 

Given a change 5ak out in the activity at an output unit k out , let 5oj(x) 
be the corresponding change in the hnal output and set 

H<^ouJbp,* : =IIM*)ll 2 (50) 
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The metric on internal units k is defined as follows: Given a change 
Jafc in the activity of unit k, let 5ai be the resulting changes in the 
activities of units k — > i directly influenced by k. Define by induction 
from the output layer 

IMIb P ,, : = E UMbp,, (51) 

i, k^i 

Given a change 56k of the parameters afferent to unit k, let 5at be the 
resulting change of activity of unit k on input x and let 

\\S0k\\l P , x ■■= \\Sa k \\ 2 hp , x (52) 
and, finally, define the backpropagated metric by 

II^II^^E^H^II^ (53) 

and 

ll^llbp^EH^IIbp (54) 

For the Fisher metric, the following is well-known. 
Proposition 14. 

The natural metric, unitwise natural metric, and backpropagated metric 
are intrinsic: \\56\\ nat , ||<5#|| u _ nat , and \\56\\ hp do not depend on a choice of 
parametrization for the activations a& and for the parameter 6j~ at each unit 
k. 

Proof. These metrics have been defined without any reference to parametriza- 
tions. □ 

The natural metric actually has stronger invariance properties than the 
unitwise natural metric: it does not depend on a change of parametrization 
of the whole parameter 6 = (#&) that would mix the various components. 
As such, the unitwise natural metric depends on a choice of decomposition 
of the network into units, while the natural metric is only a function of the 
input-output relationship of the whole network. 

Remark 15 (Change in activation profile). 

We saw above that the metric used to define a gradient represents a "cost" of 
moving in certain directions. Here, the backpropagated metric on a variation 
of the parameters at k is || S6k\\ bp = E^er* ||5afc(a;)|| b where 5cik{x) is the 
resulting change of activity on input x, weighted by its backpropagated 
influence over the final output. Thus, the "cost" of a change at unit k 
according to this metric, is the average square norm of the resulting change 
in activation profile a,k(x) over x in the dataset; moreover each unit k is 
weighted by its estimated influence on the output. 
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To compute the natural and unitwise natural metrics, it is enough to 
compute the Jacobian matrix This can be done by performing one 
backpropagation for each component of the output layer, for each input 
x € V, as follows. 

Definition 16 (Backpropagation transfer rates). 

Let fc out be an output unit and let k be any unit in the network. The 
backpropagation transfer rate J k ° ut from k out to k is the dim(a/j out ) x dim(afc) 
matrix defined by 

jkout . TJ 

J k out — iQ dim(a fcout ) 

< J^° ut : = for k^ k out in the output layer £ out (55) 

^ j^out := fc _^ jfcout |£i for non . output units k 

where is the Jacobian matrix of the activation function from unit k to 
unit j. Then we have Jt out = d( t k ° ut . 

This depends on an input x: the activation state of the network has to be 
computed by forward propagation before these quantities can be computed. 

Typically the activities are one-dimensional, not multidimensional, so 
that each J^ out is just a number, not a matrix. In this case, all the transfer 
rates J k ° ut can be computed by performing n out distinct backpropagations 
each initialized with a single 1 on the output layer. 

Since the influence of the parameter 9k on the output channels through 
the activity of unit k, the unitwise natural metric at k can be computed from 
a single number (if activities are one-dimensional) measuring the influence 
of unit k on the output, the Fisher modulus. 

Definition 17 (Fisher modulus). 

Let x be an input. Let k be a unit in the network. Let I be the metric 
on the final output uj. The Fisher modulus $fc(x) of k on input x is the 
dim(afc) x dim(afc) matrix given by 

\feoute-Cout feout / \fcoute-Cout feout / 

For each input x, the Fisher modulus is an intrinsic metric on a^: for a given 
input x, the norm 

\\ Sa k\\l-mod '■= 5a T k <S> k 5a k (57) 
does not depend on any choice of parametrization. 

Note that d ® w depends on the output layer interpretation but not on 

^out 

any parameter 9. Thus, since the transfer rates J can be computed by 
backpropagation, the Fisher modulus only involves known quantities. 
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Proposition 18 (Computation of the unitwise natural met- 
ric). 

The unitwise natural metric at unit k is given by 

ll^llu-nat=E^ ll<Mx)|| F _ mod (58) 

= ^ v5el W k ® k W k 6Bk (59) 

where 5a k (x) is the variation of af-(x) induced by 59, and is the Jacobian 
matrix of the activation function at k. Thus the matrix dehning the unitwise 
natural metric at unit k is 

Proof. By definition of the transfer rates J we have J k ° ut = Thus 

<E> ^ UjJ I (61) 
8a k 8a k 

dak T dau da^ du dco dak du T T du , . 

— - $ fc — - = — - / - = / (62) 

89k d6 k 89k 8a k 8a k 89 k 89 k 89 k 

which, after averaging over the dataset, is the definition of the unitwise 
natural metric at A;. □ 

An analogous formula can be defined for the full (rather than unitwise) 
Fisher matrix, by defining a Fisher modulus Q kk > indexed by two units, and 
using unit k! on the left and k on the right in (|56|) . Then the block of entries 
of the Fisher matrix corresponding to parameters 9 k and 9 k > is 

8a k , T 8a k 

^ **v ^ (63) 

(see also Proposition I24|) . 

The unitwise Fisher metric is costly to compute when the output layer 
is large. We can define another intrinsic metric for the activation of unit k, 
simply by backpropagating the metric of the output layer. 

The changes in the output induced by a change of 9 k all transit through 

1 1 2 

the activation of unit k. So if we have an intrinsic metric ||<5a/%|| for the 
activation of unit k, we can immediately define an intrinsic metric for 9 k , by 



hence 



dak 

defining the norm of 59 k to be the norm of the resulting 5a k . If the metric 



looking at the resulting change 5a k = -^;59 k induced by a change 59 k , and 
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on a k is given, in some parametrization of a k , by ||&i&|| 2 = 5oJ k g k 5a k where 
g k is a symmetric, positive-definite matrix of size (dima^) x (dima^), then 
defining \\S9 k \\ to be the norm of this Sa k , 



yields 



(64) 



aorv gk \oW k 6dk ) (65) 

in other words, the matrix defining this metric is 9ki%j^- 

The unitwise Fisher metric is obtained from the Fisher modulus by this 
construction. We now define another intrinsic modulus playing the same 
role for the backpropagated metric. 

Proposition 19 (Backpropagated modulus and computa- 
tion OF THE BACKPROPAGATED METRIC). 

Let x be an input. Let k be a unit in the network. Let I be the met- 
ric on the final output uj. The backpropagated modulus m k {x) at k is the 
dim(afc) x dim(afc) matrix given by 

[ ir^ I #^ for k in the output layer 

m k {x) := I da " 9a ^ T (66) 
[ 2j, ?w m j gff fo J ^ an internal unit 

Then, for each input x, the backpropagated metric on a k is given by the 
backpropagated modulus: 

\\ Sa k\\lp, x = Sa T k m k 5a k (67) 

and so the backpropagated metric on 6 k is given by the matrix E xe x> m k , 
namely, 

\\8e k \\ 2 hp = E xev Se T k ^m k ^59 k (68) 
Proof. Immediate from the definition of the backpropagated metric and 

Like the Fisher modulus, the backpropagated modulus is a single number 
when activities are one-dimensional. The cost of its computation is the same 
as one backpropagation pass. 

The equation defining the backpropagated modulus is well-known: it is 
related to the so-called Gauss-Newton approximation to the Newton method 
(see for instance [LBOM96], Section 7), which consists in computing the 
Hessian of the loss function and throwing away all terms involving the sec- 
ond derivative of the activation function (those could result in non-positive- 
definite terms, in which case the Newton method is ill-behaved), with the 
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additional approximation that cross-terms between different units are also 
thrown away. Here we see that no approximation is involved: we do not 
throw away annoying terms, we simply define an intrinsic metric. There 
is actually no meaningful notion of the Hessian of a function on manifolds 
[GHL87, paragraph 3.37] unless additional structure (affine, Riemannian) is 
given or we are at a critical point of the function; the annoying terms above 
are precisely the terms that prevent such a notion to exist. So one could 
even say, in the context of differential geometry, that the Newton method is 
an approximation of the backpropagated metric rather than the other way 
round. 

The backpropagated modulus and the Fisher modulus are related: If one 
tries to write a backpropagated equation to compute the Fisher modulus &k 
in terms of the Fisher modulus at units pointed by k, one finds a quadratic 
(instead of linear) backpropagation equation with terms involving pairs of 
units. Keeping only the terms involving a single unit yields the equation 
defining the backpropagated modulus. 

2.3 Quasi-diagonal reduction of a metric 

The unitwise Fisher metric and the backpropagated metric still involve a 
full matrix on the parameter space at each unit, and are thus not adapted if 
network connectivity is large. We now introduce two metrics enjoying lesser 
invar iance properties than the above, but quicker to compute. 

Given an intrinsic metric \\60k\\ ° n (such as the unitwise Fisher or 
backpropagated metric), we are going to define a simpler one, ||$?fc|| qd - The 
inverse of the matrix defining this metric will be quasi- diagonal, with the 
only non-zero diagonal terms being those between a weight and the bias. 
This will allow for quick gradient steps costing no more than classical back- 
propagation. 

This relies on the affine structure in neural networks: this simplification 
makes sense in a somewhat more restricted setting than the general setting 
above. Suppose that the activation function 



can be written as a composition of a fixed, non-linear activation function ip 
and a quantity that is an affine function of a^ , . . . , ai n : 



such that when 6^ ranges over its values, yk,e k ranges over all possible affine 
functions of a^ , . . . , a^ . For this to make sense, we now have to assume 
that the activities a& live in an affine space. So let us go baclH to activities 

5 Technically, this would be best described with activities taking values in affine mani- 
folds. The "weights" are then the partial derivatives dyt/dai, which are well-defined linear 
maps from the tangent space of at to the tangent space of yt- 




(69) 




(70) 
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with values in R, but without any preferred origin and scale for activities: 
we look for invariance under replacement of a% with ajOj + 

In any given parametrization (choice of origin and basis) of a« and y k we 
can write 

Vk= Wik<Xi + ^ofc (71) 

i, i^>k 

for some values Wik] specifying the parameter 9 k is equivalent to specifying 
these quantities. 

But this decomposition will change if we change the affine parametriza- 
tion of activities: if a! i = ctiOi + Pi and y' k = jkUk + &k the relation becomes 
y'k = 5 k + Ej IkWikO^ 1 (a- - Pi) + -fkWok = E^Tfe^ifcO^K + {<y k wok + $k - 
J2i Wikd~ l Pi) so that the new weights are w' ik = jkWikCt' 1 and the new bias 
is w' ok = 7fcU>ofc + <5fc — Ei^ifcQ^ 1 Pi- In particular we see that there is no 
intrinsic "separation" between the bias and the weights; but that there is a 
separation between and w^k for different incoming units i and i! . This 
is formalized as follows. 

Let 59k be a change of parameter Oj, and let Swik be the resulting change 
of Wik, with i 7^ 0. If ^Wjfc = in one parametrization, then we have Sw' ik = 
as well in any other affine parametrization since w' ik = ^kWik ^ 1 - Say that 
a change of parameter 56k does not involve unit i if 5wik vanishes: this does 
not depend on the chosen affine parametrization of activities and is thus 
intrinsic. 

Moreover, having 5wik = does not depend on the input or current 
activation state of the network: this is where we use that y k is an affine 
function of the (oj). 

For the bias the situation is different: the expression w' 0k = 7fc^ofc + ^k~ 
^iWikCt^ 1 giving the bias in a parametrization from the bias in another 
parametrization is more complex, and so the fact that 5vjQk = does depend 
on the parametrization. This is where the metric \\59k\\ we are trying to 
simplify comes into play. 

Say that a change of parameter 59k is pure bias if it does not involve any 
unit i afferent to k, i.e., if 5wik = for all i ^ 0. Say that 59k is bias-free 
if it is orthogonal, in the metric ||<50fc|| we are trying to simplify, to all pure- 
bias vectors. Thus being bias-free does not simply mean 5wok = 0. Being 
bias-free is an intrinsic condition; let us work it out. 

Let An* be the symmetric matrix defining the metric \\59k\\ in a given 
parametrization. The associated scalar product is 

(59 k ,59 k ) = Y^ A u , 5w ik 5wt, k +Y A i(5w k5wi k +5w' ok 5w ik )+A 00 5w k5w' ok 

i V i 

(72) 

with A 0i = A i0 . 

In particular, if the only non-zero component of 59k is 5wik, then its 
scalar product with a pure bias 5w' ok will be Aoi5w' ok 5wik- On the other 
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hand, if to 50k we add a bias component 5u>ok = — Ax) -Aoi&Wik, then the 
scalar product with any pure bias will vanish. Such a 50k is thus bias-free. 

In the case when the parameter Ok allows to represent all affine func- 
tions of the incoming activations, we can decompose a variation 50k of 
Ok into components 50 ki each involving only one afferent unit i, and a 
pure bias component 50 ko- This decomposition is unique if we impose 
that each 50ki is bias-free. Explicitly, if in some parametrization we have 
50k = (dwok, 5w\k, ■ ■ ■ , 5w nk k) this decomposition is 

SOh = (-A^AoiSwik, 0, . . . , 5w ik , 0, . . . , 0) (73) 

and 

50 ko = {5w ok + J2 AwAoiSwik, 0, . . . , 0) (74) 

i 

The decomposition 50k = 50ko + J2i d@ki is intrinsic. 

We can then define a new intrinsic metric on 50k by setting 

ll^llqd : =ll^0|| 2 + Ell^H 2 ( 75 ) 

i 

which is readily computed: 

ll^fcllqd = ^oo (^5w k + A m A 0i$Wikj 

+ J2( A u^ w ik - 2goi( A oo A 0i 5wik)5wik + A 00 (Aqq A 0i 5wik) 2 ) 

i 

= A 00 5wlk + 2^2 A i5w k5wik + Y J A^AQ i A w 5w ik 5w Vk 

i 

+ J2( A u-Ao 1 A 2 0l )5wl 

(76) 

Thus, this metric is defined by a matrix A given by ^4oo = ^00) Aqi = A$i 
and An, = A^A 0i Aoi' + li=i>{Au - A^A 2 m ). 

Definition 20. 

Quasi-diagonal reduction is the process which, to an intrinsic metric defined 
by a matrix A in affine coordinates, associates the metric defined by the 
matrix 

A := diag(A) + A^(v ®v)- diag(A 00 1 (u <g> v)) (77) 

where 

v = (A 00 ,...,A 0i ,...) (78) 

The quasi-diagonal backpropagated metric is the quasi-diagonal metric 
obtained from the backpropagated metric. The quasi-diagonal Fisher metric 
is the one obtained from the unitwise Fisher metric. 
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The reasoning in this section can be summarized as follows. 



Proposition 21. 

Assume that the activation function is a fixed non-linear function composed 
with an afhne function. Then the quasi-diagonal reduction A of an intrinsic 
metric A is intrinsic. 

Importantly, the matrix A = diag(v4) + A^(v (8> v) — diag(ylQQ 1 (w (g> v)) 
is the sum of a diagonal matrix and a rank-1 matrix. This allows for easy 
inversion, resulting in a quasi-diagonal inverse matrix. 

Proposition 22 (Quasi-diagonal gradient step). 

Let A be the quasi-diagonal reduction of A. Let b = (bo, ■ ■ ■ , h, . . .) and 
w = A~ 1 b. Then w is given by 

bo Aoi . . 

^0 = ~. 2^ -J~ W i ( 80 ) 

Aoo ^ Aoo 

Thus, only the entries Aoo, Mi and Aoi of the original matrix A need to 
be known in order to implement gradient descent using the quasi-diagonal 
metric defined by A. 

Note that if the original matrix A is positive-definite, we have Aoo > 
and AooAn > A\ { (by the Cauchy-Schwarz inequality applied to the first 
and i-th basis vectors), so that the solution w above is well-defined and 
unique. 

2.4 Intrinsic gradients 

Using these intrinsic metrics allows to define an intrinsic gradient direction 
in parameter space. Given a dataset V of inputs x and corresponding targets 
y, the average loss function is 

L e := E xeV £ e (y) (81) 

where we put a subscript 8 to make explicit its dependency on the parameters 
of the network. Given an intrinsic metric ||-||, the differential 

of the average loss with respect to the full parameter set 6, defines a gradient 
direction VqL by the usual definition: it is the only tangent vector such that 
for any 56 we have 

Le+se = Le + (V fl L, 86) + O(||<50|| 2 ) (83) 



28 



where (•, •) is the scalar product associated with the norm |j • || . In a parametriza- 
tion where ||-|| 2 is g 
gradient is given by 



tion where ||-|| 2 is given by a symmetric, positive definite matrix A, the 



f)T 

V e L = A- 1 — = -A- 1 G (84) 
The gradient VqL is an intrinsic tangent vector on the parameter set. 
Definition 23. 

The (general) unitwise natural gradient, backpropagated metric gradient, 
quasi-diagonal natural gradient, and quasi-diagonal backpropagated metric 
gradient, respectively, are the following update rule for 9: 

9^9- 7]V e L (85) 

where VgL is the gradient of the average loss function L computed in the 
unitwise natural metric, backpropagated metric, and their quasi-diagonal 
reductions, respectively. 

The algorithms of Section Q] are the application of these updates to ordi- 
nary neural networks, written out with [0; l]-valued activities and sigmoid 
activation function. More details on how this works out are given below 
(Section [23]). 

This update is intrinsic only under all affine reparametrizations of the 
parameter 9. Indeed, even if the tangent vector V#L giving the direction 
of the gradient is fully intrinsic, adding a tangent vector to a parameter 9 
is not an intrinsic operation (if two parametrizations differ by a non-affine 
transformation, then the additions will not amount to the same). On the 
other hand, the ideal limit when the learning rate rj tends to is intrinsic: 
the trajectories of the differential equation 

^ = ~Vo it) L (86) 

are intrinsic trajectories in parameter space for the unitwise natural gradient 
and backpropagated metric. 

For the quasi-diagonal algorithms, invariance is always restricted to affine 
reparametrizations, since this is the setup in which they are well-defined. 



2.5 The Fisher matrix for neural networks 

Applying the general definitions above to ordinary neural networks with 
sigmoid activation function leads to the algorithms of Section [TJ This is 
mostly by direct computation and we do not reproduce it fully. Let us 
however discuss in more detail the case of the Fisher metric. 

For each input x, the network defines a probability distribution on the 
outputs y. This probability distribution depends on the parameters of the 
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network. Thus, for each input x, we can define a datum-wise Fisher matrix 
on the parameter set: 



PY . iTji d£(y) dl{y) 

F{x) WijWilj , = E y[x ^ ^ (87) 

The dataset together with the network define a probability distribution 
on pairs (x,y), by first choosing at random an input x in the dataset, then 
running the network on this input. The Fisher matrix associated with this 
distribution on pairs (x, y) is the average of the datum-wise Fisher matrix 
over the dataset 

F = E xeV F(x) (88) 
(see [ANOOj . Section 8.2), or more explicitly 

d£(y\x) d£(y\x) 

F w " E *^ E y\* dWij dWi , 3 , ( 89 > 

Exact Fisher matrix versus one-sample Fisher matrix. One possi- 
ble way to train neural networks using the natural gradient is to estimate 
the Fisher matrix by taking a random input x in the dataset, taking a ran- 
dom output y for this input, and add the term 9 g^ x ^ ^Qwl^ ^° ^ e curren t 
estimate of the Fisher matrix (with a discount factor for older contributions) . 
A variant uses for y the target value for input x [RMB07], thus using the 
empirical second moment of the gradients as a metric. We refer to this as 
the one-sample Fisher matrix or one-sample natural gradient. 

The one-sample natural gradient is intrinsic in these two variants. The 
second variant, contrary to the natural gradient, depends on the targets and 
not only on the network and dataset, and is thus not naturally interpreted 
as a Riemannian metric for neural networks. 

In Section SI we compare performance of the unitwise natural gradient 
and unitwise one-sample natural gradient. 

The one-sample natural gradient can give rise to a quasi- diagonal one- 
sample natural gradient in the same way as the other methods (Section! 



Exact Fisher matrix computation. Our framework allows to compute 
the exact Fisher matrix instead of using a single value for y, by using the 
Fisher modulus and backpropagation transfer rates. The latter can be com- 
puted by doing n out backpropagations for each input. This is of course more 
convenient than summing over the (in the Bernoulli case) 2 n ° ut possible out- 
comes for y. 

Remember the backpropagation transfer rates J^° ut from Definition [21 
The latter simply implements the general Definition [TU] for ordinary neural 
networks. In Section [TJ the unitwise natural gradient was obtained from 
these transfer rates through the Fisher modulus. Here we reproduce the 
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corresponding formula for all terms of the Fisher matrix, not only the terms 
of the unitwise Fisher matrix afferent to a given unit, so we introduce a 
Fisher modulus indexed by pairs of units. 

Proposition 24 (Exact Fisher matrix for neural networks). 

Let x be an input for the network. Compute the transfer rates J k out as in 
DeEnition [21 Depending on output layer interpretation, set for each pair of 
units k and k': 

T^out r^out 

: = £ fcout6 £ out ak k nX ) (Bernoulli) 

*-out \ "-out 1 

(x) := Ek out eC out 4° ut 4° ut (square-loss) 
< $kk>(x) := | E fcout e£ out 4 fc ° ut 4° ut (classihcation) 

~ ^ (Sfc out G£out a fcout4° Ut ) (SfcoutS^out a fcout^fc'° Ut ) 

where S := J2k <. a l , 

(90) 

Then the entry of the datum-wise Fisher matrix F{x) associated with 
parameters and Wj k ' (including biases with i = or j = 0) is 

F ( x )w lkWjk , = aiCLjdkO- ~ a fc)afc'(! - a k')®kk> (91) 
and consequently the same entry in the Fisher matrix is 

Fw ikWjk , = ^xav didjCikil - a k )a k '(l - a k ')^kk' (92) 

The proof is given in the Appendix and is a more or less straightforward 
application of the results of the previous section, together with an explicit 
computation of the Fisher metric on the output in the Bernoulli, square-loss, 
and classification interpretations. 

So it is possible to compute the full Fisher matrix by performing n out 
independent backpropagations for each sample input. The Fisher matrix F, 
being the average of F(x) over the dataset, may be approximated by the 
standard online or small-batch techniques using samples from the dataset. 

For a network with only one hidden layer, this simplifies and no addi- 
tional backpropagations are needed. Indeed, the backpropagation transfer 
rates of the input layer are never used, and on the hidden layer are given by 

4° ut = Wkk out a kout (l - a kout ) (93) 

from which the Fisher modulus can be immediately computed. This is the 
case treated in [Kur94] (for the Bernoulli interpretation). 
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3 Some properties of unitwise algorithms and their 
quasi-diagonal approximations 

3.1 Performance improvement at each step 

A common feature of all gradient-based algorithms in any metric is that 
the objective function improves at each step provided the learning rate is 
small enough. Consequently this holds for the unitwise natural gradient, 
backpropagated metric gradient, and their quasi-diagonal reductions. 

Proposition 25. 

Suppose that training has not reached a local optimum, i.e., that the gradient 
vector G of Section [7] does not vanish. Suppose that the metric considered 
is non-degenerate (i.e., respectively, that the matrices or are in- 
vertible, or that the denominators in the quasi-diagonal algorithms do not 
vanish), so that the algorithms considered are well-defined. Suppose that 
the chosen interpretation to of the output layer depends smoothly on the 
output layer activities. 

Then there exists a value rjc of the learning rate such that, for any 
learning rate r/ < rjc, the value of the loss function strictly decreases after 
one step of the unitwise natural gradient, backpropagated metric gradient, 
or their quasi-diagonal reductions. 

As usual, the value of rjc depends on the current state of the network 
and thus may change over the course of training. 

3.2 Invariance properties 

The algorithms presented in Section [T] are the implementation of the gradi- 
ents and metrics defined in SectionEJ written out using [0; l]-valued activities 
and the logistic activation function. We could have written them out, for in- 
stance, using [—1; l]-valued activities and the tanh activation function, and 
the learning trajectory would be the same — provided, of course, that the 
initialization was done so that both implementations of the network behave 
the same at startup. We present a more precise formulation of this property. 

Imagine that the inputs of the network are subjected to simple transfor- 
mations such as scaling (a^ ajOj for i in the input layer) or 0/1 inversion 
(a» <— 1 — aj). There is a simple way to change the parameters of subsequent 
units so that the final activation of the network stays the same, namely, 
Wij <— Wij/cti for scaling and W{j < Wij, WQj <— woj + for 0/1 inver- 
sion. So clearly the expressivity of a neural network is not sensitive to such 
changes. 

However, training will behave differently. For instance, if we apply one 
step of backpropagation training to the scaled inputs with the scaled net- 
work, the coefficiens of units which have been scaled down (oti < 1) will 
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evolve more slowly and conversely for on > 1. The final output of the 
network after the update will be different. (Hence the common practice 
of rescaling the activities of units.) The same goes for 0/1 inversion in a 
slightly more complicated way: evolution of the bias depends on the activity 
of input units, and the weights from input units with activity close to will 
evolve faster than those with activity close to 1, as seen on Figure [TJ 

We would like the following invariance for a training procedure: If we 
start with two networks iV and N' which are fed inputs x and x' with x' 
obtained from a simple transformation of x, and if the parameters of N' 
are set such that initially its output is the same as N, then we would like 
the outputs of N and N' to stay the same after one step of the training 
procedure. 

This is not satisfied by backpropagation. However, for any affine trans- 
form of the activities of any unit, this is satisfied by the natural gradient, 
unitwise natural gradient, backpropagated metric gradient, and their quasi- 
diagonal reductions. 

The sigmoid and tanh networks correspond to each other by the following 
rewriting, thanks to tanh(x) = 2sigm(2x) — 1: if a k = sigm(J2i^ k Wi k ai + 
WQk) and a' k = t&nh(J2i^.k w 'ik a i + w 'ok) ( an d interpretation of the output 
layer in the tanh case is done by putting back the activities in [0; 1] via 
a' i — y 1/2 + a' /2), then the two networks will behave the same if we set 
w ik = 4u4 (t / 0) and w ok = 2w' 0k - 2Ei/o«4- 

Definition 26. 

Let k be an input or internal unit. Call (a, (3, 7)-affine reparametrization of 
unit k the following operation: Replace the activation of unit k 

a k = /&(<*«,..., Oi„ h ) (94) 

where 6 k = (w ok , (w ik )i-+k), 

a'k = atf g > k (ai 1 ,...,a ink ) + P (95) 

where 9' k = (w' ok , {w' ik )i^k)- Send a' k instead of a k to the next layer of the 
network, with weights modihed as follows: 

w ' k j = Wkj/a, w' 0j = w 0j - w kj f3/a (96) 

for all units j such that k — > j, and w' ik = 10^/7 for all units i with i — >■ k (in- 
cluding i = 0), so that the final outputs before and after the reparametriza- 
tion are the same. 

The passage from sigm to tanh consists in applying the (2,-1, 2)-reparametrization 
to all units. We have restricted the definition to non-output units to simplify 
notation; for output units a corresponding reparametrization of the output 
interpretation has to be done. 
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The following is an immediate consequence of the intrinsic definition of 
the algorithms. It is only part of the invariance properties of the objects 
from Section [2 we limited ourselves to this case to simplify notation. 

Proposition 27 (Invariance under affine reparametriza- 
tion). 

Consider a network obtained from an initial network by applying any num- 
ber of (a, j3, 7) -affine reparametrizations to any number of units (where a, 
j3 and 7 may depend on the unit). 

Then, after one step of the unitwise natural gradient, backpropagated 
metric gradient, or their quasi- diagonal reductions, the final outputs of the 
non-reparametrized and reparametrized networks are the same. 

Consequently, the learning trajectories, and performance, of the two net- 
works with these corresponding initializations are the same. 

This may look obvious, but we should keep in mind that this property 
is not satisfied by backpropagation or quasi-Hessian methods. 

In particular, these algorithms are insensitive to shifting and scaling of 
all units in the network. Traditionally, it is recommended to normalize the 
activities on input units so that they average to over the dataset and have 
a prescribed variance: the algorithms here automatically do the same in an 
implicit way, for all (not only input) units. As a consequence, units with 
low activation levels get updated as fast as highly activated units. (Note 
that as discussed after the definition of the quasi-diagonal algorithms, these 
averages and variances are computed according to non-uniform weights on 
the dataset given by the Fisher modulus or backpropagated modulus.) 

Still the invariance above only applies if the two networks considered have 
corresponding initializations. For instance, if the initial weights are random 
with a variance set to 1 whatever the data, obviously the initial behavior 
of the network will be sensitive to scaling of its input. So these methods 
do not remove the need for traditional recommendations for initializing the 
weights (either by normalizing the data and then taking weights of size 1, 
or by taking initial weights depending on the variance or covariance matrix 
of the data). 

The unitwise natural gradient and the backpropagated metric gradient 
(but not their quasi-diagonal reductions) have a further, more interesting 
invariance property: invariance under affine recombination of the signals a 
unit receives from its various afferent units. For instance, if we start with 
zero weights, an internal unit will evolve in the same way if it receives / and 
/ + eg (where / and g are seen as function of the input x) as if it receives 
/ and g. This is especially useful if g is correlated to the desired output. 

Proposition 28 (Invariance under affine recombination of 
incoming signals). 
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Consider a neural network and define a new one in the following way. Let 
k be a non-input unit in the network, with afferent units, and let ip : 
M fc — > M. k be an invertible afhne map. Define a new network by replacing 
the activation function at unit k 

ak = /i(ai 1 ,---,Oi n J (97) 

with 

still parametrized by Ok, where (p* (Ok) results from applying the dua® inverse 
afhne transformation <p>* to Ok , so that initially the responses of the original 
and reparametrized networks are the same. 

Then, after one step of the unitwise natural gradient or backpropa- 
gated metric gradient with respect to Ok, the final outputs of the non- 
reparametrized and reparametrized networks are the same. 

Consequently, the learning trajectories, and performance, of the two net- 
works with these corresponding initializations are the same. 

Once more, this is not simply ip in one place cancelling out ip _1 in an- 
other: indeed, backpropagation or quasi-Hessian methods do not have this 
property, and neither do the quasi-diagonally-reduced algorithms. 

Proof. This comes as a consequence of the best-fit interpretation (Proposi- 
tion CUD below. 

It also follows from the intrinsic constructions by noting that, unlike the 
quasi-diagonal reductions, the construction of these gradients never breaks 
down the n^-tuple of incoming activities into its components from each af- 
ferent unit; thus, contrary to the quasi-diagonal reductions, we could have 
written the unitwise natural and backpropagated metrics in a setting where 
activation functions are given by = f% (g(ai 1 , . . . , ai nh )) where g is a fixed, 
parameterless map with values in a manifold. □ 

3.3 Best-fit interpretation 

The unitwise natural gradient and the backpropagated metric gradient (but 
not their quasi-diagonal reductions) both have an interpretation as a least- 
squares regression problem at each unit. Namely, the backpropagated value 
bk(x,y) on input x and target y indicates how the activity of unit k should 
change on input x. Seeing bk as a function of the input x, unit k has to use 
the activities of afferent units i (also seen as functions of x) and combine 
them using the weights u^, to match bk(x,y) as close as possible for each 

6 9k is an affine form over the n^-tuple of afferent activities. </?* is defined, axiomatically, 
by the property that applying ip*(9k) to the activities transformed by <p, is the same as 
applying 8k to the untransformed activities. Decomposing 8k = (lUofe, (wik)i->k), the affine 
matrix defining ip* is the transpose of the inverse of the affine matrix defining ip. 
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x. This idea is presented in [Kur94j in a more specific setting. This is 
also relevant to the behavior of the algorithms when the matrices F and M 
defining the metrics are singular or close to singular, as we will see. 

Proposition 29 (Intrinsic gradients as best fit to b). 

Let k be a non-input unit in the network. For x in the dataset T>, let bk(x) 
be the backpropagated value ([3D obtained on input x and the corresponding 
target y. 

Consider the solution A = (Aj) to the following weighted least-squares 
problem: 

where i runs over the units afferent to k (including i = with ao = 1), &k(x) 
is the Fisher modulus (Definition 0), and the weights are 

W x := a fc (x) 2 (l - a fc (x)) 2 $ fc (x) (100) 

Then the unitwise natural gradient step (|16|) is given by A, namely, the 
update is Wik <— wo- + ??Aj at each unit k. 

The same holds for the backpropagated metric gradient using the back- 
propagated modulus rrik (Definition]?^ instead of the Fisher modulus 3>fc. 

Thus, the gradient step depends on the linear span of the afferent activi- 
ties (ai(x))i-+k, seen as functions over the dataset (which, by the way, proves 
Proposition 1281 above). This is why the gradient step is the same whether 
the unit receives signals f(x) and g(x) or f(x) and f{x) + eg(x). Thus, these 
algorithms perform an implicit orthonormalization of the incoming signals 
at each unit (not only input units) in the network. 

Proof. A direct application of the well-known formula for the solution of the 
weighted least-squares problem (fWj) . with the choice of weight (|100p . yields 
exactly the updates (fT5|) and ([25]) . □ 

Non-invertibility and regularization of the matrices. In several sit- 
uations the matrices F and M used to define the unitwise natural update 
and backpropagated metric update can be singular. 

This is the case, for instance, if one input unit is uniformly set to over 
all elements in the dataset: obviously such a unit is not informative, and the 
corresponding term will vanish both in the metric and in the gradient. This 
is also the case when, e.g., two units afferent to the same unit are perfectly 
correlated. Correlation in the activation profiles happens systematically in 
case the size of the dataset is smaller than the number of parameters afferent 
at a given unit. 
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The linear regression viewpoint limits, in theory, the seriousness of these 
issues: this only means the linear regression problem has several solutions 
(one can add any quantity to a non-informative weight), and any of them 
will do as an update. Indeed, for instance, the matrix M in Definition 
is of the form AA T , and M~ l is applied to the vector G which is of the 
form AY, thus G always lies in the image of A and thus the linear system is 
underdetermined, not over determined. From the gradient ascent viewpoint 
this means the matrix M will be singular but the gradient term dL/dw will 
vanish in the corresponding directions. 

Numerically, however, the issue must be dealt with. One can use the 
Moore-Penrose pseudoinverse of M (or F in the Fisher matrix case), ob- 
tained by adding e. Id to M or to F with very small e. This is a standard 
regularization technique. It has the advantage of producing a well-defined 
update when e — > 0, asymptotically independent of e. 

Thus, if a formal definition is needed, one can decide to use the Moore- 
Penrose pseudoinverse for M~ l and F~ l in the definition of the updates. 
However, this formally breaks the invariance properties: the Moore-Penrose 
pseudoinverse selects, among the several possible solutions (Aj) to an under- 
determined least squares problem, the one with smallest norm J^Af, an d 
this is not intrinsic. 

4 An experimental comparison 

To test the influence of the different methods, we chose a very simple problem 
in which a perfect solution is expected to be found. A sparsely connected 
network with 5 layers of size 100, 30, 10, 30, and 100 was built, and 16 
random length-100 binary strings were fed to the input layer, with the target 
equal to the input. Ideally the network learns to encode each of the 16 
samples using 4 bits on the middle layer (thus with room to spare) and uses 
the bottom layer parameters to rewrite the output from this. 

The sparsely connected network is built at random in each instance as 
follows. Each of the 100 units in the input layer is linked to 5 randomly 
selected nodes in the first hidden layer. Each of the 30 units in the first 
hidden layer is linked to 5 random nodes in the middle hidden layer. The 
scheme is reversed for the bottom part of the model: each of the 100 output 
units is linked to 5 random nodes in the last hidden layer, and each unit in 
the last hidden layer is linked to 5 random nodes of the middle hidden layer. 

For each instance, the dataset is made of 16 random binary strings of 
length 100. The target for each input is identical to the input. We use 
Bernoulli interpretation of the output. 

Note that this setting is adverse for the unitwise and quasi-diagonal nat- 
ural gradients, which require a small output layer; this must be remembered 
in the comparisons below. 
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To test the influence of parametrization on non-invariant algorithms, and 
to check invariance of the invariant ones, each algorithm was implemented 
both using sigm(x) and tanh(x) as the activation function. 

The methods tested are: backpropagation; unitwise natural gradient; 
quasi-diagonal natural gradient; unitwise one-sample natural gradient with 
current target (see Section 12, 5|) ; backpropagated metric gradient; quasi- 
diagonal backpropagated metric gradient; diagonal Gauss-Newton (equiva- 
lent to keeping only the diagonal terms in the quasi-diagonal backpropagated 
metric gradient). 

Since the sample size is small, the algorithms were run in batch mode. 

Regularization. The algorithms were taken directly from Section [TJ To 
all methods except backpropagation, we added a regularization term of 
10~ 4 Id to the various matrices involved, to stabilize numerical inversion. 
This value is not so small; values such as 10 -7 seemed to affect performance. 
This is probably due to the small sample size (16 samples): each sample con- 
tributes a rank-1 matrix to the various metrics. Larger sample sizes would 
probably need less regularization. 

Initialization. For the tanh activation function, all the weights were ini- 
tialized to a centered Gaussian random variable of variance 1, and the biases 
set to 0. For the sigmoid activation function, the initialization was the cor- 
responding one (using Eqs. 1401 and I41|) so that initially the responses of the 
networks are the same: namely, each weight was set to a centered Gaus- 
sian random variable of variance 4, and then the bias at unit k was set to 
— J2ii^>k w ik/2- This initialization has the property that if the incoming 
signals to a unit are independent, centered about 1/2 (sigmoid) or (tanh) 
and of variance a with a not too large, then the output of the unit is also 
centered of variance ~ a. (The factor 4 in the sigmoid case compensates for 
the derivative 1/4 of the sigmoid function at 0.) 

Learning rate. A simple adaptive method was used for the learning rate. 
All methods based on gradients in a metric have a guarantee of improvement 
at each step if the learning rate is small enough. So in the implementation, 
if a step was found to make the loss function worse (in a batch mode, thus 
summed over all samples), the step was cancelled and the learning rate 
was divided by 2. If the step improves the loss function, the learning rate 
is increased by a factor 1.1. The initial learning rate was set to 0.01; in 
practice the initial value of the learning rate is quickly forgotten and has 
little influence. 

Unfortunately this scheme only makes sense in batch mode, but it has the 
advantage of automatically selecting learning rates that suit each method, 
thus placing all methods on an equal footing. 
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Method Number of iterations 



Backpropagation (sigmoid) 


10,000 


Backpropagation (tanh) 


10,000 


Unitwise natural gradient 


2,100 


to 


2,300 


Quasi-diagonal natural gradient 


2,800 


to 


3,100 


Backpropagated metric 


4,200 


to 


4,300 


Quasi-diagonal backpropagated metric 


7,400 


to 


7,500 


Diagonal Gauss-Newton 


7,700 


to 


7,800 


Unitwise one-sample natural gradient 


4,000 


to 


4,100 



Table 1: Number of iterations resulting in approximately equal execution 
times for our problem 

Execution time and number of iterations. First, 10, 000 steps of back- 
propagation were performed on the whole dataset, in batch mode. The re- 
sulting running time was set aside and converted to an equivalent number 
of iterations for all of the other algorithms. This is a very rough way to pro- 
ceed, since the running times can depend on the implementation details^, 
and vary from run to run (because floating point operations do not take the 
same time depending on the numbers they are operating on, especially when 
both very small and very large values are involved. 

Most of all, the different methods scale in different ways with the network, 
and so the network used here may not be representative of other situations. 
In particular this auto-encoder setting with 100 output units puts the unit- 
wise natural gradient and quasi-diagonal natural gradient at a disadvantage 
(since on the same time budget they must performed n ou t backpropagations 
per sample), compared to, e.g., a classification setting. 

Nevertheless, we give in Table 2] the numbers of iterations giving roughly 
equal running time for each method. 

The natural gradient was also tested (using the exact full Fisher matrix 
as obtained from Proposition I24|) . The computational cost is very high and 
only 10 iterations take place in the alloted time, too few for convergence. 
Thus we do not report the associated results. 

Results. In TableEl we report the average loss per sample, in bits, and its 
standard deviation, at the end of the allocated number of training iterations. 
These values can be interpreted as representing the average number of bits 
of an output that the model did not learn to predict correctly (out of 100). 
The results of the implementations using sigmoid and tanh activation are 
reported separately. 

Performance as a function of time is plotted in Figure El 

7 We tried to implement each method equally carefully. 
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Method 



Average loss (bits) ± std-dev 



sigmoid tanh 



Backpropagation 

Unitwise natural gradient 

Quasi-diagonal natural gradient 

Backpropagated metric 

Quasi-diagonal backpropagated metric 

Diagonal Gauss-Newton 

Unitwise one-sample natural gradient 



35.9 ±2.1 24.7 ±2.2 

0.9 ±1 1.4 ±1.8 

3.5 ±1.2 3.4 ±1.6 

0.8 ±0.8 0.3 ±0.5 

I. 9 ±1.2 1.5 ±1.3 

II. 6 ±2.5 3.5 ±2.0 
24.7 ± 2.5 28.5 ±3.4 



Table 2: Average loss per sample (bits) after an execution time equivalent 
to 10, 000 backpropagation passes, computed over 20 independent runs 

The statistics were made using 20 independent runs for each method. 

Interpretation. These results are mainly illustrative: the situation con- 
sidered here may not be representative because of the small sample size and 
network size involved. 

Still, it is clear that for problems of this size, the more elaborate algo- 
rithms are very competitive. Only the tanh implementation of the diagonal 
Gauss-Newton method comes close to the invariant algorithms (while its 
performance in sigmoid implementation is not as good). 

As can be expected, the invariant algorithms have similar performance 
in sigmoid or tanh implementation: trajectories match each other closely 
(Figure^). The variations are caused, first, by random initialization of the 
dataset and weights in each run, and second, by the inclusion of the regu- 
larization terms eld, which breaks invariance. If the effect of the latter is 
isolated, by having the same initialization in tanh and sigmoid implementa- 
tions, the trajectories coincide for a the first few iterations but then start to 
differ; however, overall performance is not affected. 

In this setting, the natural gradient, in its unitwise and quasi-diagonal 
versions, seems to perform slightly worse than the backpropagated metric 
methods. Remember, however, that this is an effect of the large output 
layer size combined with a given computation time budget. Per iteration 
instead of computation time, the natural gradient methods perform better 
than the backpropagated metric methods, and we expect them to be more 
competitive for smaller output sizes, e.g., in classification tasks. 

The relatively bad performance of the unitwise one-sample natural gra- 
dient is a surprise. This method is invariant, although not naturally inter- 
preted as an intrinsic Riemannian metric. Analysis of the runs shows that 
the learning rates (adjusted as described above) decrease steadily through 
optimization. This may be related to poor invertibility of the matrices 
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Computation time (arbitrary units) 



Figure 3: Performance over time of all algorithms involved. For better 
readability the trajectories of the invariant algorithms have been plotted only 
in tanh implementation (Figure El shows them in sigmoid implementation for 
completeness) . 
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Figure 4: Comparison of the trajectories of the invariant algorithms in tanh 
and sigmoid implementations 
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involved. An indication in this direction comes from the difference in perfor- 
mance in the sigmoid and tanh implementation: this algorithm is invariant 
and the only difference can come from (non-invariant) regularization kicking 
in. Low rank of the matrices may in turn be related to the large dimension- 
ality of the output layer (in contrast to [RMB07] ) : Indeed, reasoning on the 
full (whole-network) Fisher matrix, the one-sample natural gradient con- 
tributes a matrix of rank 1 for each data sample; on the other hand, the 
exact datum-wise Fisher matrix contributes a sum of n out matrices of rank 1 
for each data sample as can be seen from (|91|) . Thus from a theoretical view- 
point the quality of the one-sample approximation likely depends on output 
dimensionality. (The small size of the dataset in our setting is not enough 
to cause this problem, as it does not seem to affect the other methods.) 

Lack of invariance is striking for some algorithms, such as the diago- 
nal Gauss-Newton method, the performance of which is very different in 
the sigmoid and tanh interpretations (Figure [5]) . The quasi-diagonal back- 
propagated metric method only differs from diagonal Gauss-Newton by the 
inclusion of a small number of non-diagonal terms in the update. This 
change brings the sigmoid and tanh implementations in line with each other, 
improving performance with respect to the best of the two Gauss-Newton 
implementations. In settings where the activities of units (especially internal 
units, since the input can always be centered) are not as well centered as here, 
we expect the quasi-diagonal backpropagated metric method to outperform 
the tanh diagonal Gauss-Newton implementation even more clearly. So, ar- 
guably, the quasi-diagonal backpropagated metric is "the invariant way" to 
write the diagonal Gauss-Newton algorithm. 

In conclusion, although this experiment is a small-scale one, it clearly 
emphasizes the interest of using invariant algorithms. 

Appendix: Proof of Proposition 1241 

The Fisher metric depends, of course, on the interpretation of the output 
layer as a probability distribution uj. The final output a; is a probability 
distribution over the values of y in the target space, parametrized by the 
activities at of the output units k. If the output activities change by 8a k , 
the probability distribution u will change as well. The norm of this change 
in Fisher metric is 



MIL = E^((51na;(y)) 2 



(101) 




k,k'€C out 



E 




5ak5dk< 



(102) 



thus stemming from the matrix 



hk' '■— 



E 



d In uj(y) dlnu(y) 



(103) 



da k da k , 
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Computation time (arbitrary units) 

Figure 5: Effect of introducing a few non-diagonal terms: Comparison of 
the diagonal Gauss-Newton and the quasi-diagonal backpropagated metric 
methods 



over the output layer. 

In the Bernoulli interpretation, for each component k of the output layer, 
the random variable is a Bernoulli variable with parameter a&. In the 
square-loss interpretation, each is a Gaussian random variable with mean 
afc and variance 1. In the classification interpretation, y is a discrete random 
variable which takes value k with probability ujk = &&/(X^g£ out a f)- 

Let us compute the Fisher metric in the space u in each case. In the 
Bernoulli case, we have ui(y) = Y\k^c OVLt ^-yk=^ a k + ly fc =o(l — o-k))- Conse- 
quently 

dlnujjy) = t yk=1 _ t yk=0 = yj-a k (104) 
dak a>k 1 — ojfc a k (l - a^) 

Since under the distribution uj we have Ey^ = a k and Vary/% = a^(l — a k ) 
(with yk and yj independent for k ^ j) we find 

hk' = l k=k ' ; (105) 
- aft) 

for A; and A;' in the output layer. 

In the Gaussian case we have u(y) = ]J k e {Vl 1 so that 



yj — ook- Since under the distribution lj we have Ey/u = uj^ and Vary^ = 1 
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we find I = Id hence 

hk' = t k= y (106) 

for k and k! in the output layer. 

In the classification case the probability to have y = j is uj(y) = aj/S 

with S = Eiec out al Thus dhiu(y)/du k = 2(±g* - f ). Taking the 
expectation over y we find 



l kk' 



«Efpf-f)(^-f) M 
= 4^1^ J_i fc _ fc; _ 4^^^ _ S^—— + A\ y ( ^\—— 

Sa k a k , k = k ' Sa k S S a k , S \f S ) S S 

(108) 

= ^t k=k 1^ (109) 

These give the expression of the Fisher matrix I kk > for k and k' in the 
output layer. This is enough to compute the full Fisher matrix, as follows. 

Let x be an input for the network. Given a variation 59 of the network 
parameters 9, let 5ai be the resulting variation of unit i, and let 5u be the 
resulting variation of the final output uj. We have 5u = J2kec ut Jaj~^ afc ' 
The datum-wise Fisher metric on 9 is 



li»liL = HML ( n °) 

= E I kk >Sa k 5a k > (111) 

out 

For each k in the output layer, we have 5a k = J2i 7%f~3@i where the sum 
runs over all units i in the network. For each i we have = |^ ^ = jf ^ . 
Plugging this into the above yields 



y da; da; 



iL = EE E E ^"^wM (112) 

so that the term of the Fisher matrix corresponding to 59 i and 59 ^ is 

l^keCout l^k'eCout 1 kk>'Ji gg. gg., ■ 

For standard neural networks we have 59i = (5wji)jj^i and the sigmoid 



a; 



activation function yields -g-^- = ajai(l 

Plugging into this the expression for I kk i yields the results in Proposi- 
tion El 
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